R. J. Serrano
03/10/2023
Learning objectives:
Develop linear models with time series.
Developing useful predictors for linear models.
Residual diagnostics.
Selecting predictors and forecast evaluation.
Forecasting with regression.
Correlation, causation and forecasting.
Matrix formulation (refer to textbook Chapter 7.9).
In this chapter, we will discuss regression models. The basic concept is that we forecast the time series of interest \(y\) assuming that is has a linear relationship with other time series \(x\).
In the simplest case, the regression model allows for a linear relationship between the forecast variable \(y\) and a single predictor \(x\):
Figure 7.1: An example of data from a simple linear regression model
Let’s plot the time series of quarterly changes (growth rates) of real personal consumption expenditure (\(y\)) and real personal disposable income (\(x\)) for the U.S. from 1970 Q1 and 2019 Q2.
us_change %>%
pivot_longer(c(Consumption, Income),
names_to = 'Series') %>%
autoplot(value) +
labs(y = '% change')Using the lm base function to calculate intercept and slope coefficients.
us_chg_tbl <- us_change %>%
as_tibble() %>%
select(Consumption, Income)
model_lm <- lm(Consumption ~ Income, data = us_chg_tbl)
summary(model_lm)##
## Call:
## lm(formula = Consumption ~ Income, data = us_chg_tbl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.58236 -0.27777 0.01862 0.32330 1.42229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.54454 0.05403 10.079 < 2e-16 ***
## Income 0.27183 0.04673 5.817 2.4e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5905 on 196 degrees of freedom
## Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429
## F-statistic: 33.84 on 1 and 196 DF, p-value: 2.402e-08
Scatterplot with fitted regression line.
us_change |>
ggplot(aes(x = Income, y = Consumption)) +
labs(y = "Consumption (quarterly % change)",
x = "Income (quarterly % change)") +
geom_point() +
geom_smooth(method = "lm", se = FALSE)Cross-correlation between the two time series.
The cross correlation plot can be used to determine lags as useful predictors.
us_change |>
# drop `Consumption`, `Income` columns
select(-Consumption, -Income) |>
pivot_longer(-Quarter) |>
ggplot(aes(Quarter, value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y") +
guides(colour = "none") +
labs(y="% change")Pair plots for us_change
library(GGally)
us_change %>%
ggpairs(columns = 2:6,
lower = list(
continuous = wrap('smooth',
color = 'steelblue',
alpha = 0.3,
size = 0.3)
)
)They have mean zero; otherwise the forecasts will be systematically biased.
They are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited.
They are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model.
The residuals follow a Gaussian distribution (normal) with a constant variance (\(\sigma^{2}\)).
The least squares principle provides a way of choosing the coefficients effectively by minimising the sum of the squared errors.
Example: US consumption expenditure
fit_consMR <- us_change |>
model(tslm = TSLM(Consumption ~ Income + Production +
Unemployment + Savings))
report(fit_consMR)## Series: Consumption
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90555 -0.15821 -0.03608 0.13618 1.15471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.253105 0.034470 7.343 5.71e-12 ***
## Income 0.740583 0.040115 18.461 < 2e-16 ***
## Production 0.047173 0.023142 2.038 0.0429 *
## Unemployment -0.174685 0.095511 -1.829 0.0689 .
## Savings -0.052890 0.002924 -18.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3102 on 193 degrees of freedom
## Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
## F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16
augment(fit_consMR) |>
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Consumption, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
labs(y = NULL,
title = "Percent change in US consumption expenditure"
) +
scale_colour_manual(values=c(Data="black",Fitted="#D55E00")) +
guides(colour = guide_legend(title = NULL))augment(fit_consMR) |>
ggplot(aes(x = Consumption, y = .fitted)) +
geom_point() +
labs(
y = "Fitted (predicted values)",
x = "Data (actual values)",
title = "Percent change in US consumption expenditure"
) +
geom_abline(intercept = 0, slope = 1, color = 'red')A common way to summarise how well a linear regression model fits the data is via the coefficient of determination, or \(R^{2}\). For the example above, the \(R^{2}\) = 0.768.
Residual plots
Using the gg_tsresiduals() function introduced in Section 5.3, we can obtain all the useful residual diagnostics mentioned above.
Shapiro-Wilk Normaliy Test
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.96039, p-value = 2.397e-05
The Shapiro-Wilk test performs a hypothesis test for normality (Gaussian) distribution. If the p-vaue < 0.05, then we reject the null hypothesis (residuals follow a Gaussian distribution) for the alternate (reject the null hypothesis).
Another way to test the residuals assumption is with a Q-Q plot.
## [1] 23 21
Ljung-Box Test (residuals independently distributed)
The time plot shows some changing variation over time, but is otherwise relatively unremarkable. This heteroscedasticity will potentially make the prediction interval coverage inaccurate.
The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals.
The autocorrelation plot shows a significant spike at lag 7, and a significant Ljung-Box test at the 5% level. However, the autocorrelation is not particularly large, and at lag 7 it is unlikely to have any noticeable impact on the forecasts or the prediction intervals.
The residuals from the multiple regression model for forecasting US consumption plotted against each predictor in Figure 7.9 seem to be randomly scattered. Therefore we are satisfied with these in this case.
us_change |>
left_join(residuals(fit_consMR), by = "Quarter") |>
pivot_longer(Income:Unemployment,
names_to = "regressor", values_to = "x") |>
ggplot(aes(x = x, y = .resid)) +
geom_point() +
facet_wrap(. ~ regressor, scales = "free_x") +
labs(y = "Residuals", x = "")A plot of the residuals against the fitted values should also show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant.
augment(fit_consMR) |>
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() + labs(x = "Fitted", y = "Residuals")Observations that take extreme values compared to the majority of the data are called outliers. Observations that have a large influence on the estimated coefficients of a regression model are called influential observations. Usually, influential observations are also outliers that are extreme in the \(x\) direction.
Trend
Dummy variables (ex. public/observed holidays)
Seasonal dummy variables
recent_production <- aus_production |>
filter(year(Quarter) >= 1992)
recent_production |>
autoplot(Beer) +
geom_smooth(method = 'loess', se = FALSE, color = 'steelblue') +
labs(y = "Megalitres",
title = "Australian quarterly beer production")We want to forecast the value of future beer production. We can model this data using a regression model with a linear trend and quarterly dummy variable.
## Series: Beer
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.9029 -7.5995 -0.4594 7.9908 21.7895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 441.80044 3.73353 118.333 < 2e-16 ***
## trend() -0.34027 0.06657 -5.111 2.73e-06 ***
## season()year2 -34.65973 3.96832 -8.734 9.10e-13 ***
## season()year3 -17.82164 4.02249 -4.430 3.45e-05 ***
## season()year4 72.79641 4.02305 18.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16
Note that trend() and season() are not standard functions; they are “special” functions that work within the TSLM() model formulae.
augment)augment(fit_beer) |>
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Beer, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
scale_colour_manual(
values = c(Data = "black", Fitted = "#D55E00")
) +
labs(y = "Megalitres",
title = "Australian quarterly beer production") +
guides(colour = guide_legend(title = "Series"))Plotly version
g <- augment(fit_beer) %>%
select(Quarter, Beer, .fitted) %>%
as_tibble() %>%
pivot_longer(-Quarter) %>%
ggplot(aes(Quarter, value, group = name, color = name)) +
geom_line(linewidth = 0.5) +
labs(x = NULL,
y = "Megalitres",
title = "Australian quarterly beer production",
color = 'Series')
ggplotly(g)augment(fit_beer) |>
ggplot(aes(x = Beer, y = .fitted,
colour = factor(quarter(Quarter)))) +
geom_point() +
labs(y = "Fitted", x = "Actual values",
title = "Australian quarterly beer production") +
geom_abline(intercept = 0, slope = 1) +
guides(colour = guide_legend(title = "Quarter"))Intervention variables
Trading days
Distributed lags
Easter (example of holiday that date changes each year)
An alternative to using seasonal dummy variables, especially for long seasonal periods, is to use Fourier terms.
These Fourier terms are produced using the fourier() function. For example, the Australian beer data can be modelled like this.
fourier_beer <- recent_production |>
model(TSLM(Beer ~ trend() + fourier(K = 2)))
report(fourier_beer)## Series: Beer
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.9029 -7.5995 -0.4594 7.9908 21.7895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 446.87920 2.87321 155.533 < 2e-16 ***
## trend() -0.34027 0.06657 -5.111 2.73e-06 ***
## fourier(K = 2)C1_4 8.91082 2.01125 4.430 3.45e-05 ***
## fourier(K = 2)S1_4 -53.72807 2.01125 -26.714 < 2e-16 ***
## fourier(K = 2)C2_4 -13.98958 1.42256 -9.834 9.26e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16
The K argument to fourier() specifies how many pairs of sin and cos terms to include. The maximum allowed is \(K = m / 2\) where m is the seasonal period. Because we have used the maximum here, the results are identical to those obtained when using seasonal dummy variables.
A common approach that is not recommended is to plot the forecast variable against a particular predictor and if there is no noticeable relationship, drop that predictor from the model. This is invalid because it is not always possible to see the relationship from a scatterplot, especially when the effects of other predictors have not been accounted for.
Another common approach which is also invalid is to do a multiple linear regression on all the predictors and disregard all variables whose \(p\)-values are greater than 0.05. To start with, statistical significance does not always indicate predictive value. Even if forecasting is not the goal, this is not a good strategy because the \(p\)-values can be misleading when two or more predictors are correlated with each other (see Section 7.8).
Instead, we will use a measure of predictive accuracy. Five such measures are introduced in this section. They can be shown using the glance() function, here applied to the model for US consumption:
We compare these values against the corresponding values from other models. For the CV, AIC, AICc and BIC measures, we want to find the model with the lowest value; for Adjusted \(R^2\), we seek the model with the highest value.
Predictive Accuracy Measures Comparison Chart
Fig. 7.1
Ex-ante forecasts are those that are made using only the information that is available in advance.
For example, ex-ante forecasts for the percentage change in US consumption for quarters following the end of the sample, should only use information that was available up to and including 2019 Q2.
In order to generate ex-ante forecasts, the model requires forecasts of the predictors.
Ex-post forecasts are those that are made using later information on the predictors.
For example, ex-post forecasts of consumption may use the actual observations of the predictors, once these have been observed.
These are not genuine forecasts, but are useful for studying the behaviour of forecasting models.
Normally, we cannot use actual future values of the predictor variables when producing ex-ante forecasts because their values will not be known in advance.
However, the special predictors introduced in Section 7.4 are all known in advance, as they are based on calendar variables (e.g., seasonal dummy variables or public holiday indicators) or deterministic functions of time (e.g. time trend).
In such cases, there is no difference between ex-ante and ex-post forecasts.
recent_production <- aus_production |>
filter(year(Quarter) >= 1992)
fit_beer <- recent_production |>
model(TSLM(Beer ~ trend() + season()))
fc_beer <- forecast(fit_beer)
fc_beer |>
autoplot(recent_production) +
labs(
title = "Forecasts of beer production using regression",
y = "megalitres"
)In this setting, the forecaster assumes possible scenarios for the predictor variables that are of interest.
For example, a US policy maker may be interested in comparing the predicted change in consumption when there is a constant growth of 1% and 0.5% respectively for income and savings with no change in the employment rate, versus a respective decline of 1% and 0.5%, for each of the four quarters following the end of the sample.
fit_consBest <- us_change |>
model(
lm = TSLM(Consumption ~ Income + Savings + Unemployment)
)
future_scenarios <- scenarios(
Increase = new_data(us_change, 4) |>
mutate(Income=1, Savings=0.5, Unemployment=0),
Decrease = new_data(us_change, 4) |>
mutate(Income=-1, Savings=-0.5, Unemployment=0),
names_to = "Scenario")
fc <- forecast(fit_consBest, new_data = future_scenarios)
us_change |>
autoplot(Consumption) +
autolayer(fc) +
labs(title = "US consumption", y = "% change")The great advantage of regression models is that they can be used to capture important relationships between the forecast variable of interest and the predictor variables.
However, for ex ante forecasts, these models require future values of each predictor, which can be challenging.
If forecasting each predictor is too difficult, we may use scenario-based forecasting instead, where we assume specific future values for all predictors.
Example
fit_cons <- us_change |>
model(TSLM(Consumption ~ Income))
new_cons <- scenarios(
"Average increase" = new_data(us_change, 4) |>
mutate(Income = mean(us_change$Income)),
"Extreme increase" = new_data(us_change, 4) |>
mutate(Income = 12),
names_to = "Scenario"
)
fcast <- forecast(fit_cons, new_cons)
us_change |>
autoplot(Consumption) +
autolayer(fcast) +
labs(title = "US consumption", y = "% change")Although the linear relationship assumed so far in this chapter is often adequate, there are many cases in which a nonlinear functional form is more suitable. To keep things simple in this section we assume that we only have one predictor \(x\).
The simplest way of modelling a nonlinear relationship is to transform the forecast variable \(y\) and/or the predictor variable \(x\) before estimating a regression model. While this provides a non-linear functional form, the model is still linear in the parameters. The most commonly used transformation is the (natural) logarithm (see Section 3.1).
Example: Boston marathon winning times
boston_men <- boston_marathon |>
filter(Year >= 1924) |>
filter(Event == "Men's open division") |>
mutate(Minutes = as.numeric(Time)/60)
boston_menLet’s plot Year and Minutes with a fitted trend line (‘lm’)
boston_men %>%
ggplot(aes(Year, Minutes)) +
geom_line(linewidth = 1) +
geom_smooth(method = 'lm', se = FALSE, color = 'steelblue') +
labs(x = 'Year',
y = 'Minutes',
title = 'Boston marathon winning times')Now, let’s create the same plot, but with a ‘loess’ smoother.
boston_men %>%
ggplot(aes(Year, Minutes)) +
geom_line(linewidth = 1) +
geom_smooth(method = 'loess', se = FALSE, color = 'steelblue') +
labs(x = 'Year',
y = 'Minutes',
title = 'Boston marathon winning times')It is clear that the linear pattern is not a good fit for the observed data after 1980 (previous trend is not extended after 1980).
Residual analysis from the linear model
fit_trends <- boston_men |>
model(
linear = TSLM(Minutes ~ trend()),
exponential = TSLM(log(Minutes) ~ trend()),
piecewise = TSLM(Minutes ~ trend(knots = c(1950, 1980)))
)
fc_trends <- fit_trends |> forecast(h = 10)
boston_men |>
autoplot(Minutes) +
geom_line(data = fitted(fit_trends),
aes(y = .fitted, colour = .model)) +
autolayer(fc_trends, alpha = 0.5, level = 95) +
labs(y = "Minutes",
title = "Boston marathon winning times")The best forecasts appear to come from the piecewise linear trend.
Correlation is not causation
It is important not to confuse correlation with causation, or causation with forecasting. A variable \(x\) may be useful for forecasting a variable \(y\), but that does not mean \(x\) is causing \(y\). It is possible that \(x\) is causing \(y\), but it may be that \(y\) is causing \(x\), or that the relationship between them is more complicated than simple causality.
We describe a variable that is not included in our forecasting model as a confounder when it influences both the response variable and at least one predictor variable. Confounding makes it difficult to determine what variables are causing changes in other variables, but it does not necessarily make forecasting more difficult.
When two or more predictors are highly correlated it is always challenging to accurately separate their individual effects.
Having correlated predictors is not really a problem for forecasting, as we can still compute forecasts without needing to separate out the effects of the predictors.
However, it becomes a problem with scenario forecasting as the scenarios should take account of the relationships between predictors. It is also a problem if some historical analysis of the contributions of various predictors is required.
A closely related issue is multicollinearity, which occurs when similar information is provided by two or more of the predictor variables in a multiple regression.
It can occur when two predictors are highly correlated with each other (that is, they have a correlation coefficient close to +1 or -1).
In this case, knowing the value of one of the variables tells you a lot about the value of the other variable.
Hence, they are providing similar information. For example, foot size can be used to predict height, but including the size of both left and right feet in the same model is not going to make the forecasts any better, although it won’t make them worse either.